20 research outputs found
An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis
Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced
classification. Furthermore, data characteristics have a significant impact on the performance
of imbalanced classifiers, which are generally neglected by existing evaluation
methods. The objective of this study is to introduce a new criterion to comprehensively
evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established
using data envelopment analysis without explicit inputs (DEA-WEI), to determine
the trade-off between the benefits of improved minority class accuracy and the cost of
reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced
ratio and typical imbalanced data characteristics on the efficiency of the classifiers.
Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as
C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble
and undersampling techniques are more effective for overlapping and noisy data. The efficiency
of cost-sensitive classifiers decreases dramatically when the imbalanced ratio
increases. Finally, we investigate the reasons for the different efficiencies of classifiers on
imbalanced data and recommend steps to select appropriate classifiers for imbalanced data
based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023
71725001
71771037
7197104
Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets
In many real application areas, the data used are highly skewed and the number of
instances for some classes are much higher than that of the other classes. Solving a classification
task using such an imbalanced data-set is difficult due to the bias of the training
towards the majority classes.
The aim of this paper is to improve the performance of fuzzy rule based classification systems
on imbalanced domains, increasing the granularity of the fuzzy partitions on the
boundary areas between the classes, in order to obtain a better separability. We propose
the use of a hierarchical fuzzy rule based classification system, which is based on the
refinement of a simple linguistic fuzzy model by means of the extension of the structure
of the knowledge base in a hierarchical way and the use of a genetic rule selection process
in order to get a compact and accurate model.
The good performance of this approach is shown through an extensive experimental
study carried out over a large collection of imbalanced data-sets.Spanish Ministry of Education and Science (MEC) under Projects TIN-2005-08386-C05-01 and TIN-2005-08386-
C05-0
An Analysis of the Rule Weights and Fuzzy Reasoning Methods for Linguistic Rule Based Classification Systems Applied to Problems with Highly Imbalanced Data Sets
In this contribution we carry out an analysis of the rule
weights and Fuzzy Reasoning Methods for Fuzzy Rule Based Classification
Systems in the framework of imbalanced data-sets with a high
imbalance degree. We analyze the behaviour of the Fuzzy Rule Based
Classification Systems searching for the best configuration of rule weight
and Fuzzy Reasoning Method also studying the cooperation of some
pre-processing methods of instances. To do so we use a simple rule base
obtained with the Chi (and co-authors’) method that extends the wellknown
Wang and Mendel method to classification problems.
The results obtained show the necessity to apply an instance preprocessing
step and the clear differences in the use of the rule weight
and Fuzzy Reasoning Method.
Finally, it is empirically proved that there is a superior performance
of Fuzzy Rule Based Classification Systems compared to the 1-NN and
C4.5 classifiers in the framework of highly imbalanced data-sets.Spanish Projects TIN-2005-08386-C05-01 & TIC-2005-08386-
C05-0
Why Linguistic Fuzzy Rule Based Classification Systems perform well in Big Data Applications?
The significance of addressing Big Data applications is beyond all doubt. The current ability of extracting interesting knowledge from large volumes of information provides great advantages to both corporations and academia. Therefore, researchers and practitioners must deal with the problem of scalability so that Machine Learning and Data Mining algorithms can address Big Data properly. With this end, the MapReduce programming framework is by far the most widely used mechanism to implement fault-tolerant distributed applications. This novel framework implies the design of a divide-and-conquer mechanism in which local models are learned separately in one stage (Map tasks) whereas a second stage (Reduce) is devoted to aggregate all sub-models into a single solution. In this paper, we focus on the analysis of the behavior of Linguistic Fuzzy Rule Based Classification Systems when embedded into a MapReduce working procedure. By retrieving different information regarding the rules learned throughout the MapReduce process, we will be able to identify some of the capabilities of this particular paradigm that allowed them to provide a good performance when addressing Big Data problems. In summary, we will show that linguistic fuzzy classifiers are a robust approach in case of scalability requirements.This work have been partially supported by the
Spanish Ministry of Science and Technology under
projects TIN2014-57251-P and TIN2015-68454-R
Multiplex Analysis of CircRNAs from Plasma Extracellular Vesicle-Enriched Samples for the Detection of Early-Stage Non-Small Cell Lung Cancer
Background: The analysis of liquid biopsies brings new opportunities in the precision
oncology field. Under this context, extracellular vesicle circular RNAs (EV-circRNAs) have gained
interest as biomarkers for lung cancer (LC) detection. However, standardized and robust protocols
need to be developed to boost their potential in the clinical setting. Although nCounter has been
used for the analysis of other liquid biopsy substrates and biomarkers, it has never been employed
for EV-circRNA analysis of LC patients. Methods: EVs were isolated from early-stage LC patients
(n = 36) and controls (n = 30). Different volumes of plasma, together with different number of preamplification
cycles, were tested to reach the best nCounter outcome. Differential expression analysis
of circRNAs was performed, along with the testing of different machine learning (ML) methods for
the development of a prognostic signature for LC. Results: A combination of 500 L of plasma input
with 10 cycles of pre-amplification was selected for the rest of the study. Eight circRNAs were found
upregulated in LC. Further ML analysis selected a 10-circRNA signature able to discriminate LC from
controls with AUC ROC of 0.86. Conclusions: This study validates the use of the nCounter platform
for multiplexed EV-circRNA expression studies in LC patient samples, allowing the development of
prognostic signatures.European Union's Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant 76549
IIVFDT: Ignorance Functions based Interval-Valued Fuzzy Decision Tree with Genetic Tuning
The choice of membership functions plays an essential role in the success of fuzzy systems. This is a complex problem due to the possible lack of knowledge when assigning punctual values as membership degrees. To face this handicap, we propose a methodology called Ignorance functions based Interval-Valued Fuzzy Decision Tree with genetic tuning, IIVFDT for short, which allows to improve the performance of fuzzy decision trees by taking into account the ignorance degree. This ignorance degree is the result of a weak ignorance function applied to the punctual value set as membership degree.
Our IIVFDT proposal is composed of four steps: (1) the base fuzzy decision tree is generated using the fuzzy ID3 algorithm; (2) the linguistic labels are modeled with Interval-Valued Fuzzy Sets. To do so, a new parametrized construction method of Interval-Valued Fuzzy Sets is defined, whose length represents such ignorance degree; (3) the fuzzy reasoning method is extended to work with this representation of the linguistic terms; (4) an evolutionary tuning step is applied for computing the optimal ignorance degree for each Interval-Valued Fuzzy Set.
The experimental study shows that the IIVFDT method allows the results provided by the initial fuzzy ID3 with and without Interval-Valued Fuzzy Sets to be outperformed. The suitability of the proposed methodology is shown with respect to both several state-of-the-art fuzzy decision trees and C4.5. Furthermore, we analyze the quality of our approach versus two methods that learn the fuzzy decision tree using genetic algorithms. Finally, we show that a superior performance can be achieved by means of the positive synergy obtained when applying the well known genetic tuning of the lateral position after the application of the IIVFDT method.Spanish Government
TIN2011-28488
TIN2010-1505
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is
considered \de facto" standard in the framework of learning from imbalanced data. This
is due to its simplicity in the design of the procedure, as well as its robustness when applied
to di erent type of problems. Since its publication in 2002, SMOTE has proven
successful in a variety of applications from several di erent domains. SMOTE has also inspired
several approaches to counter the issue of class imbalance, and has also signi cantly
contributed to new supervised learning paradigms, including multilabel classi cation, incremental
learning, semi-supervised learning, multi-instance learning, among others. It is
standard benchmark for learning from imbalanced data. It is also featured in a number of
di erent software packages | from open source to commercial. In this paper, marking the
fteen year anniversary of SMOTE, we re
ect on the SMOTE journey, discuss the current
state of a airs with SMOTE, its applications, and also identify the next set of challenges
to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology
under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project
887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016;
and the National Science Foundation (NSF) Grant IIS-1447795
SOUL: Scala Oversampling and Undersampling Library for imbalance classification
This work has been supported by the research project TIN2017-89517-P, by the UGR research contract OTRI 3940 and by a research scholarship, given to the authors Nestor Rodriguez and David Lopez by the University of Granada, Spain.The improvements in technology and computation have promoted a global adoption of Data Science.
It is devoted to extracting significant knowledge from high amounts of information by means of the
application of Artificial Intelligence and Machine Learning tools. Among the different tasks within Data
Science, classification is probably the most widespread overall.
Focusing on the classification scenario, we often face some datasets in which the number of
instances for one of the classes is much lower than that of the remaining ones. This issue is known as
the imbalanced classification problem, and it is mainly related to the need for boosting the recognition
of the minority class examples.
In spite of a large number of solutions that were proposed in the specialized literature to address
imbalanced classification, there is a lack of open-source software that compiles the most relevant ones
in an easy-to-use and scalable way. In this paper, we present a novel software approach named as
SOUL, which stands for Scala Oversampling and Undersampling Library for imbalanced classification.
The main capabilities of this new library include a large number of different data preprocessing
techniques, efficient execution of these approaches, and a graphical environment to contrast the output
for the different preprocessing solutions.UGR research contract OTRI 3940University of Granada, SpainTIN2017-89517-
Combinatorial Blood Platelets-Derived circRNA and mRNA Signature for Early-Stage Lung Cancer Detection
The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms24054881/s1.Despite the diversity of liquid biopsy transcriptomic repertoire, numerous studies often
exploit only a single RNA type signature for diagnostic biomarker potential. This frequently results in
insufficient sensitivity and specificity necessary to reach diagnostic utility. Combinatorial biomarker
approaches may offer a more reliable diagnosis. Here, we investigated the synergistic contributions
of circRNA and mRNA signatures derived from blood platelets as biomarkers for lung cancer
detection. We developed a comprehensive bioinformatics pipeline permitting an analysis of platelet-
circRNA and mRNA derived from non-cancer individuals and lung cancer patients. An optimal
selected signature is then used to generate the predictive classification model using machine learning
algorithm. Using an individual signature of 21 circRNA and 28 mRNA, the predictive models
reached an area under the curve (AUC) of 0.88 and 0.81, respectively. Importantly, combinatorial
analysis including both types of RNAs resulted in an 8-target signature (6 mRNA and 2 circRNA),
enhancing the differentiation of lung cancer from controls (AUC of 0.92). Additionally, we identified
five biomarkers potentially specific for early-stage detection of lung cancer. Our proof-of-concept
study presents the first multi-analyte-based approach for the analysis of platelets-derived biomarkers,
providing a potential combinatorial diagnostic signature for lung cancer detection.European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie 765492
Digital multiplexed analysis of circular RNAs in FFPE and fresh non-small cell lung cancer specimens
We would like to thank Stephanie Davis for her language editing assistance. The investigators also wish to thank the patients for kindly agreeing to donate samples to this study. We thank all the physicians who collaborated by providing clinical information. Graphical Abstract, Figs 1A, 8A and Fig. S1 were created with Biorender.com. This project has received funding from a European Union's Horizon 2020 research and innovation program under the Marie SklodowskaCurie grant agreement ELBA No 765492.Although many studies highlight the implication of circular RNAs (circRNAs)
in carcinogenesis and tumor progression, their potential as cancer
biomarkers has not yet been fully explored in the clinic due to the limitations
of current quantification methods. Here, we report the use of the
nCounter platform as a valid technology for the analysis of circRNA
expression patterns in non-small cell lung cancer (NSCLC) specimens.
Under this context, our custom-made circRNA panel was able to detect
circRNA expression both in NSCLC cells and formalin-fixed paraffinembedded
(FFPE) tissues. CircFUT8 was overexpressed in NSCLC, contrasting
with circEPB41L2, circBNC2, and circSOX13 downregulation even
at the early stages of the disease. Machine learning (ML) approaches from
different paradigms allowed discrimination of NSCLC from nontumor controls
(NTCs) with an 8-circRNA signature. An additional 4-circRNA signature
was able to classify early-stage NSCLC samples from NTC,
reaching a maximum area under the ROC curve (AUC) of 0.981. Our
results not only present two circRNA signatures with diagnosis potential
but also introduce nCounter processing following ML as a feasible protocol
for the study and development of circRNA signatures for NSCLC.European Commission 76549